Loop Tiling in Large-Scale Stencil Codes at Run-Time with OPS
نویسندگان
چکیده
منابع مشابه
Supplementary Material: Loop Tiling in Large-Scale Stencil Codes at Run-time with OPS
if(dir == g_xdir) { if(sweep_number == 1) { ops_par_loop(advec_cell_kernel1_xdir, "advec_cell_kernel1_xdir", clover_grid, 2, rangexy, ops_arg_dat(work_array1, 1, S2D_00, "double", OPS_WRITE), ops_arg_dat(work_array2, 1, S2D_00, "double", OPS_WRITE), ops_arg_dat(volume, 1, S2D_00, "double", OPS_READ), ops_arg_dat(vol_flux_x, 1, S2D_00_P10, "double", OPS_READ), ops_arg_dat(vol_flux_y, 1, S2D_00_0...
متن کاملWriting productive stencil codes with overlapped tiling ‡ 3
Stencil computations constitute the kernel of many scientific applications. Tiling is often used to improve 11 the performance of stencil codes for data locality and parallelism. However, tiled stencil codes typically require shadow regions, whose management becomes a burden to programmers. In fact, it is often the 13 case that the code required to manage these regions, and in particular their ...
متن کاملReal-Time Large-Scale Dense 3D Reconstruction with Loop Closure
In the highly active research field of dense 3D reconstruction and modelling, loop closure is still a largely unsolved problem. While a number of previous works show how to accumulate keyframes, globally optimize their pose on closure, and compute a dense 3D model as a post-processing step, in this paper we propose an online framework which delivers a consistent 3D model to the user in real tim...
متن کاملRun-time thread management for large-scale distributed-memory multiprocessors
E ective thread management is crucial to achieving good performance on large-scale distributed-memory multiprocessors that support dynamic threads. For a given parallel computation with some associated task graph, a thread-management algorithm produces a running schedule as output, subject to the precedence constraints imposed by the task graph and the constraints imposed by the interprocessor ...
متن کاملCode Refinement of Stencil Codes
A straightforward implementation of an algorithm in a general-purpose programming language does usually not deliver peak performance: Compilers often fail to automatically tune the code for certain hardware peculiarities like memory hierarchy or vector execution units. Manually tuning the code is firstly error-prone as well as time-consuming and secondly taints the code by exposing those peculi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Parallel and Distributed Systems
سال: 2018
ISSN: 1045-9219
DOI: 10.1109/tpds.2017.2778161